Run Args¶
meta arguments¶
Now you have a concept of the Function
, Dataset
and Runners
, it’s time to talk about run_args
. These are “extra” arguments that go alongside your function, but do not interact directly with it.
This can create some confusing terminology, so lets be explicit. Whenever args
are discussed, this is referring to the actual Function
arguments, i.e. what the Runner
is storing. run_args
deal with things like the remote directory and resource requests.
Native Arguments¶
Here we will cover the run_args
that are natively understood by a run.
While you can implement your own functionality (more on that later), the following arguments are common to all runs.
Skip & Force¶
This is a special “contextual” run arg. There are three situations where there args are relevant.
Dataset init¶
By default, when defining Dataset(...)
, a search is done to see if a matching Dataset has already been created. If this is the case, the current creation will be “skipped”, and the Dataset will instead be unpacked from the previous state.
Setting skip=False
will ensure a new Dataset
is created, deleting the old database in the process.
force=True
is ignored here, only skip
has any function.
Note
It is advised to use Dataset(..., skip=False)
while testing, as it ensures consistent behaviour. Only once you care about the result should you drop this argument (or change it to True
).
Run append¶
Any Runner
that already exists cannot be added to a Dataset
.
With skip=False
runner will be appended anyway. This does not overwrite the existing runner, and allows for multiple copies of the same run.
force=True
acts as an inverted alias of skip
. i.e. skip=False
== force=True
Run()¶
When running a Dataset
, is_finished
is called to get the states of any runners. Any that are already running or have completed will not be submitted.
skip=False
allows runners which are already submitted to be resubmitted
In general force=True
functions as an inverted alias of skip
. However there is an additional keyword argument force_ignores_success
which is required to resubmit runners considered as “succeeded”. This is an extra safeguard against overwriting data.
Important
force_ignores_success
is required for skip=False
/force=True
to function on runners which are considered to have succeeded. This is a runner which has successfully returned a result file.
Dirs¶
The most commonly set run_args are the *_dir
family. These designate where your run files will end up and it is recommended to change these from defaults when doing a full run. remotemanger
can create a lot of small files, which can make directory navigation cumbersome, even with proper segmentation.
local_dir¶
This directory is on your machine, and dictates where the runners will “stage” from. When running, files are first written to this directory then sent to the remote.
remote_dir¶
This directory is the main one on the remote machine, and is where all the main run files are copied to.
run_dir¶
This directory is not always used, it exists within the remote_dir
, and is where the run will actually be executed.
Warning
Be careful using run_dir with runs where the file system needs to be interacted with. A good example is when sending extra files, you will need to access them using ../file
, for example.
Run modifiers¶
Asynchronous¶
True by default, ensures that runs are executed in parallel. Set to False to force a dataset to execute its runners one after another (only functions when submitter="bash"
)
Argument Hierarchy¶
run_args
can be set at multiple levels.
Dataset
- This is the “top level” storage, all runners inherit from this dictionaryRunner
- Runners can have their own “local” run_args, just for that runRun
/Temporary
- when running a Dataset, you can also pass arguments into the run. These are considered “temporary” arguments, and will be dropped after the run completes.
Lets demonstrate what this looks like, starting with the defaults:
[1]:
from remotemanager import Dataset
def function(inp):
return inp
# skip=False will be used heavily throughout the tutorials
# it is recommended that you also do so when experimenting
ds = Dataset(function, skip=False)
ds.append_run({"inp": 1})
appended run runner-0
[2]:
print(ds.run_args)
{'skip': True, 'force': False, 'asynchronous': True, 'local_dir': 'temp_runner_local', 'remote_dir': 'temp_runner_remote'}
The defaults here mean:
Runs will try to skip (if they already have results)
Runs will not be forced
Jobs will be run asynchronously
The local staging directory is
temp_runner_local
The remote running directory is
temp_runner_remote
By comparison, the Runner
object will appear to have no run_args, since these are “overrides” that are set at the runner level.
[3]:
print(ds.runners[0].run_args)
{}
When running a job, these arguments are combined into a single dictionary.
This can be seen at derived_run_args
:
[4]:
print(ds.runners[0].derived_run_args)
{'skip': True, 'force': False, 'asynchronous': True, 'local_dir': 'temp_runner_local', 'remote_dir': 'temp_runner_remote'}
Setting run_args¶
Now we know the default values, how do we change them?
Firstly, any argument passed to Dataset
, append_run
, or run()
that is not part of those functions will be treated as a run_arg
. However you can update them after initialisation.
There are multiple ways to update or set args. The most obvious way is to directly update the run_arg
dictionaries, but there also functions that can do this more “explicity”.
Note
These functions exist on both Dataset
and Runner
.
Lets start by demonstrating a direct method:
[5]:
ds.run_args["direct"] = True
for k, v in ds.run_args.items():
print(k, v)
skip True
force False
asynchronous True
local_dir temp_runner_local
remote_dir temp_runner_remote
direct True
set_run_args¶
This function can take a list of keys and values, and set them. You can also pass a single (key, val) pair.
[6]:
ds.set_run_args(["a", "b", "c"], [1, 2, 3])
ds.set_run_args("d", 4)
for k, v in ds.run_args.items():
print(k, v)
skip True
force False
asynchronous True
local_dir temp_runner_local
remote_dir temp_runner_remote
direct True
a 1
b 2
c 3
d 4
update_run_args¶
This function takes a dictionary of arguments and updates the inner run_args with it. Useful for setting a large set of arguments at once.
[7]:
ds.update_run_args({"a": 10, "b": 11, "c": 12, "d": 13})
for k, v in ds.run_args.items():
print(k, v)
skip True
force False
asynchronous True
local_dir temp_runner_local
remote_dir temp_runner_remote
direct True
a 10
b 11
c 12
d 13
Custom run_args¶
Unhandled run_args
will be ignored by a run. However if you are using a Computer that accepts arguments for its script()
method, they can be used there.
The main use for this dynamic ability is for scheduler resources, and this is covered in depth within the Scheduler Tutorial.
Runner overrides¶
The run_args
of Runner
act as “local” overrides for whatever is set in Dataset
.
We can demonstrate this by setting a value on the runner.
[8]:
ds.runners[0].run_args["d"] = "foo"
print("Dataset args:", ds.run_args.get("d", None))
print("Runner args:", ds.runners[0].run_args.get("d", None))
print("Derived args:", ds.runners[0].derived_run_args.get("d", None))
Dataset args: 13
Runner args: foo
Derived args: foo
At the Dataset
level, the value of d
is still 13. However, on the runner on which we override the value, it is now “foo”.
Any other runners, will retain the Dataset
level value.
[9]:
ds.append_run({"inp": 2})
print("Derived args:", ds.runners[1].derived_run_args.get("d", None))
appended run runner-1
Derived args: 13
Temporary Run() Args¶
As was mentioned previously, you can also pass the same args to the Run()
call of a Dataset. While difficult to demonstrate here, you can verify it by setting a remote_dir
to something and then updating that arg within Run()
.
Args set this way are discarded after the run, and are considered “temporary” by the Dataset
, whereas args set any other way are saved when the Dataset is.